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We have recently developed a scalable algorithm for ordering the instantaneous observations of a dynamical 
system evolving continuously in time. Here, we apply the method to long molecular dynamics trajectories. 
The procedure requires only a pairwise, geometrical distance as input. Suitable annotations of both 
structural and kinetic nature reveal the free energy basins visited by biomolecules. The profile is 
supplemented by a trace of the temporal evolution of the system highlighting the sequence of events. We 
demonstrate that the resultant SAPPHIRE (States And Pathways Projected with High REsolution) plots 
provide a comprehensive picture of the thermodynamics and kinetics of complex, molecular systems 
exhibiting dynamics covering a range of time and length scales. Information on pathways connecting states 
and the level of recurrence are quickly inferred from the visualisation. The considerable advantages of our 
approach are speed and resolution: the SAPPHIRE plot is scalable to very large data sets and represents every 
single snapshot. This minimizes the risk of missing states because of overlap or prior coarse-graining of the 
data. 

Present day science or more broadly society record observations as a function of time in diverse contexts'. 
Data on meteorological phenomena, communication tracking, or financial markets, to name a few, are all 
mined for the generation of predictive models. The raw data are usually unfit for human consumption due 
to their high dimensionality and sheer size^"". Both aspects limit the types of analyses performed on these "big 
data" to those algorithms with satisfactory scaling properties'*. In biophysics, long computer simulations of the 
trajectories of complex macromolecules with high-dimensional representations have become commonplace^, and 
this is where our particular interest lies'". 

Molecular dynamics (MD) simulations of proteins and other biomolecules' record stochastic trajectories, in 
which the macromolecule visits a number of different, metastable states (free energy basins) connected by an 
ensemble of pathways of interconversion. The latter report on the barriers of the underlying free energy land- 
scape". Because millions of snapshots are now routinely recorded for thousands of coupled degrees of freedom'', 
MD trajectories call for scalable algorithms that are able to provide information-preserving projections for this 
specific class of complex systems. We have recently introduced such an algorithm' and provide a brief description 
next. 

Given a definition of distance between trajectory snapshots, the entire data set is considered as a complete 
graph with vertices corresponding to snapshots and edge weights given by the pairwise distances between snap- 
shots. Either the exact or an approximation to the minimum spanning tree are computed. From a generally 
arbitrary starting point, the available edge with smallest weight is followed to define a sequence of snapshots, the 
so-called progress index. The available edges at each point are those connecting any snapshot not considered yet 
(we refer to this set of snapshots as A) with any snapshot already included (the set S). The resulting sequence has 
the crucial property of stepping through high density regions one by one. It can therefore be expected that all free 
energy basins will appear as groups of nearby points along the progress index. Importantly, the progress index 
does not reflect the temporal nature of the input data in any way, i.e., it is generally independent of input order. 
Because every snapshot is considered, the limiting resolution is optimal given the time resolution of the input 
trajectory. 

The progress index can be annotated both kinetically and structurally to provide an informative and compact 
representation of all major states visited by the input trajectory. The procedure has several advantages over 
projections using geometric or kinetic distances from a reference state to order snapshots. First, it maximizes 
resolution as mentioned above. Second, it avoids overlap precisely because the ordering is not with respect to a 
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Figure 1 | SAPPHIRE plot for FiP35. (a) The progress index, of 10' snapshots from 200 |j.sofMD data, is annotated with kinetic information (tmfps black 
curve), dynamical trace (red dots), DSSP assignment" by residue (legend on top) and the state partitioning of Berezovska et al}° These annotations are 
only shown for every 1000"", 100"", lOOO"* and SOO"" snapshots, respectively, in order to maintain readability at fixed figure resolution. The limits of possible 
definitions of the folded and unfolded states for the computation of transition path times are indicated by the blue, horizontal lines. Cartoons'' of a 
snapshot in the native state and an unfolded conformation are shown, (b) Zoom-in on the transition region of the SAPPHIRE plot shown in (a). The 
various annotations are shown for every 100"", 10"", 50"", and 250"" snapshots, respectively. Representative conformations of II and 12 are shown as 
cartoons. The box highlights a particular state (see text). 



particular state. Third, it is easy to use, scalable, and requires a notion 
of distance between snapshots as the only "parameter". Like most 
data mining methods exploiting pairwise similarity as a guide, e.g., 
clustering'", it requires sufficient sampling density. The sampling 
weights of individual basins can in general be resolved quantitatively. 

In the present, short contribution, we apply the method of 
Blochliger et al? to two molecular dynamics trajectories of proteins, 
which were produced using dedicated hardware". We annotate the 
plot threefold, viz., structurally, kinetically, and with times of occur- 
rence in the original trajectories (called dynamical trace hereafter). 
We demonstrate that the information summarized in the resultant 
SAPPHIRE (States And Pathways Projected with High REsolution) 
plot provides an efficient means of identifying the statistically reliable 
states visited by a complex, dynamical system while enabling a rapid 
assessment of state interconversion and recurrence, which provide 
information on kinetic pathways and simulation convergence, 
respectively. 

Results 

We present results on two different proteins. The data on Fip35'^, a 
small WW domain, come from two long MD trajectories and 
describe reversible transitions of this peptide between the folded 
state, a three-stranded P-sheet, and a coU-like unfolded state. The 
single MD trajectory obtained for the 58-residue bovine pancreatic 
trypsin inhibitor (BPTI), a protein with a mixed a/p fold'^', exhibits 
few transitions between distinct folded states that differ prominently 
in the isomerization states of disulphide bridges. We analyse both of 
these data sets with SAPPHIRE plots. The general annotation func- 
tions we use are as follows: 

1. The sets A and S allow us to stipulate a two-state Markov state 
model, and we can derive the mean first passage times in either 
direction". A cut function as used elsewhere'* allows an analyt- 



ical evaluation. We define the average of the two values as Tmfp- 
This kinetic annotation is expected to highlight barriers reliably 
with the caveat that it cannot be interpreted quantitatively due to 
the simplicity of the two-state model. For data sets obtained by 
concatenating many short MD trajectories, the cut function is 
adjusted to ignore the spurious transitions at the break points 
between two trajectories. 

2. The actual time of occurrence in the input data (dynamical trace) 
is plotted for each snapshot as an annotation highlighting direct 
transitions between states (pathways). Because the progress 
index is expected to be free of overlap, this allows a straightfor- 
ward assessment of recurrence. This annotation is less inform- 
ative if the data set is a concatenation of short trajectories where 
each continuous segment visits only one or few basins. 

3. States themselves are characterized by a structural annotation. 
This is necessarily system-specific and requires prior knowledge 
of the system and data. An informative, geometric annotation 
can be exceptionally helpful in connecting the states identified 
by the kinetic annotation with a structural interpretation fit for 
human consumption. Structural annotations do not depend on 
input order, i.e., they are useful even for unordered input data. 

Reversible folding of a 35-residue protein domain. FiP35'^ exhibits 
reversible folding at a simulation temperature of 395 K in explicit 
solvent molecular dynamics runs of a total length of 200 
Specifically, the trajectories show that FiP35 converts 10-15 times 
between an unfolded state that is very low in secondary structure 
content and the native topology, viz., a twisted, three-stranded 
P-sheet"''^'"". All following results refer to a specific computational 
model and sampling protocol" underlying the trajectories being 
analysed. Due to the protein's small size, it is possible to provide a 
comprehensive, structural representation at the backbone level using 
a DSSP annotation'^ resolved by residue. Fig. 1(a) shows the 
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SAPPHIRE plot for the composite trajectory using this annotation, 
and it is immediately apparent that the native topology is observed 
more than 50% of the time. The native basin is delineated by the 
kinetic annotation (black line) as expected. The unfolded state shows 
no consistent secondary structure and is kinetically homogeneous, 
suggesting that FiP35 should be described well as a two-state folder. 

Fig. 1(b) highlights one of the major advantages of our approach, 
i.e., its high resolution. Here, we zoom in on the transition region. 
Previous analyses of the same data suggested the existence of at least 
two intermediates"*^^", and we have additionally annotated the 
SAPPHIRE plot with the state partitioning proposed by 
Berezovska et al.^" Referring to their Fig. 3, we denote the larger 
and smaller of the two unlabelled states as L and S, respectively. By 
also taking into account the dynamical trace. Fig. 1 allows us to 
quickly extract the following results: 

• 12 is identified with a very homogeneous state sampled exten- 
sively only once during the 200 |.is and characterized by a 
three-stranded P-sheet topology with shifted registry for the N- 
terminal hairpin (see DSSP annotation in Fig. 1(b)). It is referred 
to as a kinetic trap elsewhere"', and its sampling weight is 0.5- 
1.0%. 

• In state II, only the N-terminal hairpin is formed with the C- 
terminus largely coil-like. This is the state sampled most often 
when converting between folded and unfolded states (F and U, 
respectively) via an intermediate, and its weight is —2.5%. 

• Structurally, L consists largely of a boundary (barrier) region 
between II and U. Our data suggest that the kinetically and geo- 
metrically homogeneous state highlighted by the black box in 
Fig. 1(b) should have been separated out. 

• States U and F are explored for extended periods of time less than 
10 times each. Several excursions into intermediate states are 
unproductive. 

• We cannot assign obvious meaning to state S. 

• Fig. 1(a) suggests the existence of an additional state with a weight 
of —5%, in which the N-terminal turn is in an alternative con- 
formation. Similar differences are observed when comparing 
NMR structures of apo and holo forms of related WW domains^'. 
While kinetically distinct, this state may have been ignored by 
Berezovska et al. because it is not on-pathway. Indeed, it is likely 
to correspond to the state labelled holo by Lane et al.^"^. 

The picture emerging from the above is overall congruent with 
analyses of the same data reported elsewhere""'"* This also 

extends to the pathway information regarding dominant routes of 
folding. The main advantage of carefully constructed Markov state 
models is of course that probability flux and time scales are explicit 
and quantitative. This bears the caveat that the required lag times 
may be so large that useful information on or below the timescale of 
this lag time is lost. Conversely, Fig. 1 allows many of the important 
conclusions arrived at in the literature from a single plot that can be 
produced in near-linear time with respect to the number of snap- 
shots. Some of the kinetic information such as probability flux and 
pathways are qualitative in nature only. It maybe necessary to rescale 
the plot to resolve some of the finer details. It is also important to 
keep in mind that the sequence from left to right does not correspond 
to real pathways taken by the trajectory even though it may some- 
times appear that way. Pathway information is gleaned exclusively 
from the dynamical trace, which is read vertically from bottom to 
top. 

The SAPPHIRE plot allows a very straightforward grouping of 
snapshots into states. From these groupings, we can compute further 
quantities such as the times taken to reach the folded state from the 
unfolded state and vice versa (transition path times). Fig. 1(a) indi- 
cates two extreme definitions of folded and unfolded states, and 
within these limits the computed transition path times range from 
20 to 180 ns. Experiments suggest more roughness of the underlying 



landscape leading to longer transition times ( ~ 1 |.is)^''. We reempha- 
size that Fig. 1 is conditional upon a specific model, i.e., force field, 
and sampling protocol, which in all likelihood are prone to both 
systematic and statistical errors. We merely perform the analysis here 
to demonstrate how the SAPPHIRE plot is also an excellent starting 
point for further efforts in characterizing and understanding the data 
and system at hand. 

Native state dynamics of a folded protein. Analyses of MD 
simulations of the folded state ensemble of the 58-residue bovine 
pancreatic trypsin inhibitor (BPTI) have revealed that a number of 
states identified by NMR experiments^^'^*" are populated significantly 
in the trajectories albeit with inaccurate weights. These metastable 
states can often be correlated with isomerizations of the disulphide 
bridges, in particular CysM-CysSS^''^". Shaw et aZ." used a stochastic 
algorithm to obtain a coarse, kinetic clustering of their 1.03 ms MD 
trajectory of BPTI relying on the autocorrelation of interatomic 
distances. Empirically, they found that five states with significant 
populations could be identified reliably. These states were 
annotated structurally. The most important states (the smaller of 
the two is the one most resembling the crystal structure) are clearly 
seen in the SAPPHIRE plot as the first two basins from the left 
(Fig. 2). The structural annotation we select here confirms that the 
barrier identified by the kinetic analysis (black line) is related to the 
isomerization of the disulphide bond Cysl4-Cys38. The dynamical 
trace uses the colour scheme of Shaw et al. (distinguishing the red, 
blue, green, purple, and black states). It unmasks that both major 
states are long-lived and that there is a clear separation of time scales 
with respect to the mixing time within each basin. 

Fig. 2 indicates that there is a mismatch in assignment between 
that of Shaw et al. and the positions on the x-axis for several, short 
excursions into a given state. As an example, we consider the high- 
lighted trajectory segment sampled at —0.5 ms that is annotated by 
Shaw et al. to be in the red state but that is placed by the SAPPHIRE 
plot in the basin corresponding to the blue state. To understand why 
this may be the case, we first note that the structural annotations 
generally reveal a small amount of mixing that may be considered 
erroneous. Indeed, for the segment in question, inspection of instant- 
aneous values yields that the Cysl4 side chain angles adopt the values 
for the blue state, but the 5(3 angle and the X2 angle of Cys38 do not 
(not shown). The combination of values for the dihedral angles 
places this segment outside of the list of states characterized prev- 
iously^". It appears kinetically homogeneous and may correspond to 
an incomplete or blocked transition. Its sampling weight is so low 
that neither the SAPPHIRE plot nor the kinetic clustering are sens- 
itive enough to resolve it as an independent state. Due to its inter- 
mediate nature, it is lumped into either one of the adjacent states. A 
very similar effect is observed for a second, highlighted segment (at 
—0.75 ms), for which just the two Cys38 side chain angles deviate 
from the blue state. 

The SAPPHIRE plot for BPTI also reveals that over the course of 
the 1.03 ms trajectory the purple and black states are sampled exten- 
sively just once and twice, respectively. This allows us to infer a lack 
of recurrence, i.e., sampling weights are unlikely to be converged. 
Poor sampling may also limit the number of states obtainable from 
Markov state models™ and decrease the accuracy of any extracted 
passage times. The bottom panel of Fig. 2 zooms into a very thin time 
slice to illustrate the pathway taken to reach the black state. This is 
annotated by cartoons and a specific, interatomic distance involving 
a residue identified by the original authors as being discriminative for 
this state". The final result we want to mention in this short note is 
that the SAPPHIRE plot suggests the green state to be partitioned 
further. The kinetic annotation is consistent with the dynamical trace 
in that the two major substates of the green state are homogeneous 
with respect to the times they were sampled at (no mixing). This is 
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Figure 2 | SAPPHIRE plot for BPTI. (Upper panel) The progress index, of 41250 snapshots from 1.03 ms of MD data, is annotated with kinetic 
information (tmfp> black curve), dynamical trace (dots coloured according to the kinetic clustering of Shaw etal.)", and selected dihedral angles. These 
annotations are only shown for every 20"*, 2'"' and 2°'' snapshots, respectively, in order to maintain readability at fixed figure resolution. The annotation 
with dihedral angles uses binning into up to three bins with boundaries chosen as follows: Cysl4 Xi (^120°, —5°, 120°), Cysl4 X2 (^140°, 0°, 130°), 
Cysl4-Cys38 Xs (0°, 150°), Cys38 Xi (-155°, - 105°, 120°), Cys38 Xi (-120°, 0°, 140°), Arg42 v|/ (-100°, 75°), and Asp3 ip (0°, 100°). These boundaries 
were obtained from direct inspection of the individual histograms for each angle. Boxes highlight two brief stretches of the trajectory referred to in the text. 
(Lower panel) Zoom-in on a thin time slice of the dynamical trace to visualise a particular transition from the red to the black state. End points of this 
transition are shown as cartoons with Argl and Phe4 in a stick-like representation". The plot is annotated further by the distance between the atom of 
Phe4 and the Cg of Argl, which is shown for every 5"* snapshot. 



despite the fact that they appear to be directly adjacent to one another 
in terms of transition pathways. 

We conclude the description of the performance of the SAPPHIRE 
plot with a note of caution. In Fig. 2, toward the right side of the 
largest basin, there is a region of both temporal and geometric ambi- 
guity most clearly seen by the overlap of blue and red dots in the 
dynamical trace. Here, the progress index is placing "fringe" regions 
of both basins. This weakness results from an insufficient sampling 
density for these lower likelihood regions that immediately surround 
well-defined states. It is rectified by having better time resolution or, 
at the risk of a decrease in resolution, by lowering the dimensionality 
of representation. We show the data on the sparsely sampled traject- 
ory here to illustrate both the general robustness and possible errors 
encountered with smaller data sets. 

Discussion 

With growing computing resources and growing data sets, it has 
become paramount to use tools that quickly and efficiently improve 
our understanding of a system as complex as a biomolecule. The data 
required for the SAPPHIRE plot with all three annotations can usu- 
ally be computed in near-linear time in a single run by the CAMPARI 



simulation and analysis package (http://campari.sourceforge.net). 
The plots are ideally generated as fully scalable vector graphics. At 
fixed resolution, readability may be improved by displaying annota- 
tions more sparsely, and we have done this for both figures. The 
required user input is the definition of a suitable measure of pairwise 
distance, and this choice may also help determine which structural 
annotations to use. 

In Figs. 1 and 2, we have shown that SAPPHIRE plots offer an 
efficient procedure for the analysis and comprehensive pictorial 
description of complex systems undergoing stochastic evolution, 
such as proteins. Thermodynamics are resolved quantitatively, and 
the construction of the ordering of snapshots minimizes the risk of 
state overlap. Major basins are delineated easily by all three annota- 
tion functions. Qualitative information about pathways is available at 
the temporal resolution offered by the trajectory itself The rapid 
availability of this information is not only valuable per se but can 
also be used to guide further simulations and analyses. 

Methods 

The algorithm underlying the SAPPHIRE plot has been describe qualitatively above 
(see Introduction and Results). For a complete description we refer the reader to the 
original publication^. In terms of efficiency, the overall annotation procedure requires 
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linear time with respect to the number of snapshots. The calculation of the required 
spanning tree is the most expensive step of the algorithm and is aided by heuristics in 
either variant (exact or approximate). The approximate version can be scaled to very 
large data sets. When using this version, it will generally be useful to rerun the analysis 
a few times due to the stochastic nature of the spanning tree. In particular, the kinetic 
annotation function is sensitive to where a basin appears in the progress index and 
how well basins to the left have been captured. 

The FiP35 trajectory encompasses 10^ snapshots saved every 200 ps, while the 
41250 snapshots of data on BPTI have a coarser time resolution of 25 ns. Pairwise 
distances were defined as the coordinate root mean square deviation (RMSD) com- 
puted over the backbone oxygen and nitrogen atoms of residues 7-29 for FiP35 and 
over 695 nonsymmetric atoms for BPTI. These choices reflect the different levels of 
variance in the two data sets. The approximate algorithm was used for both systems. It 
requires additional parameters as follows. The number of guesses to find putative 
nearest neighbours from within a limited space defined by preorganization of the data 
via clustering^" was set to a value of 1000 throughout. The lower threshold radii for 
clusters were 3.0 and 2.5A for FiP35 and BPTI, respectively, and the upper threshold 
radii were 10.0 and 3. OA. The required input data took ~ 1 1 and —5 hours to compute 
on a single Intel Xeon core (either E5435 or E5410) for Figs. 1 and 2, respectively. 
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